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Abstract 


Algorithms for hyperparameter optimization 
abound, all of which work well under different 
and often unverifiable assumptions. Moti¬ 
vated by the general challenge of sequentially 
choosing which algorithm to use, we study 
the more specific task of choosing among dis¬ 
tributions to use for random hyperparameter 
optimization. This work is naturally framed 
in the extreme bandit setting, which deals 
with sequentially choosing which distribution 
from a collection to sample in order to min¬ 
imize (maximize) the single best cost (re¬ 
ward) . Whereas the distributions in the stan¬ 
dard bandit setting are primarily character¬ 
ized by their means, a number of subtleties 
arise when we care about the minimal cost 
as opposed to the average cost. For example, 
there may not be a well-defined “best” dis¬ 
tribution as there is in the standard bandit 
setting. The best distribution depends on the 
rewards that have been obtained and on the 
remaining time horizon. Whereas in the stan¬ 
dard bandit setting, it is sensible to compare 
policies with an oracle which plays the sin¬ 
gle best arm, in the extreme bandit setting, 
there are multiple sensible oracle models. We 
define a sensible notion of “extreme regret” 
in the extreme bandit setting, which parallels 
the concept of regret in the standard bandit 
setting. We then prove that no policy can 
asymptotically achieve no extreme regret. 


1 Introduction 


mizing a black-box objective f: Q —> [0,1] which we 
can only evaluate pointwise. As an example, w £ 
may parameterize the architecture of a convolu¬ 
tional network, and f{u>) may be the validation er¬ 
ror when the network with that architecture is trained 
on a particular data set. A number of approaches 
have been applied to the optimization of / includ¬ 
ing Bayesian optimization, covariance matrix adap- 


ods (for an incomplete list, see 

Bergstra and BengicJ 

(2012); Berestra et al. 2011'): 

Snoek et al. (2012): 

Hansen (20061); Wang et al. (2013); Lagariaset al. 

(1998j); Powell ( 

2006h: Duchi et al. (20151)). 


In some sense, random search is the benchmark of 
choice. Whereas other approaches work well un¬ 
der various and often unverifiable conditions (such as 
smoothness or convexity of the objective), random 
search has strong finite-sample guarantees that hold 
without any assumptions on the function under con¬ 
sideration. This guarantee is illustrated by the so- 
called rule of 59\2 which states that the best of 59 
random samples will be in the best 5 percent of all 
samples with probability at least 0.95. More gener¬ 
ally, any distribution over the set of hyperparameters 
f l induces a distribution /i over the validation error in 
[0,1]. Let F tl be the cumulative distribution function 
of /x, and suppose that F u is continuous. Suppose that 
x\ ,..., Xt are independent and identically-distributed 
samples from /j, (obtained, for instance, by indepen¬ 
dently sampling hyperparameters w* and evaluating 
x t = /(w t ) for 1 < t < T). The following is known. 

Lemma 1. The distribution of the extreme cost 
min{a;i,..., xt} is easily described with quantiles. We 
have P(F' Al (min{a:i,..., xt}) < a) = 1 — (1 — a) T . 
More specifically, ^(minjxi,..., xt}) is a Beta( 1, T) 
random variable. 


Our motivation comes from hyperparameter optimiza¬ 
tion and more generally from the challenge of mini- 

Appearing in Proceedings of the 19 th International Con¬ 
ference on Artificial Intelligence and Statistics (AISTATS) 
2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright 
2016 by the authors. 


Proof. The event F M (min{xi,..., Xt}) > ot happens 
if and only if F^(xt) > a for each t, which happens 

1 Though the y are known, the rule of 59 and Lemma [l] 
do not appear in lBergstra and Bengid d2012j ~). and they are 
difficult to find in the literature. 
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with probability (1 — a) T . Differentiating the result¬ 
ing cumulative distribution function gives the density 
function of a Beta(l,T) random variable. □ 

The generality of Lemma[l]comes at a price. The guar¬ 
antee is given with respect to the distribution /u., but 
there is no guarantee about p itself. Different induced 
distributions /i may arise from different parameteri- 
zations of the hyperparameter space fl (for example, 
from the decision to put a uniform or a log-uniform 
distribution over a coordinate of w), and the alloca¬ 
tion of mass over [0,1] may vary wildly based on these 
choices. 


distributions (arms) Hi = (/xi ,..., hk)- The kth dis¬ 
tribution generates sample Xk,t at time t, for integer 
t > 1, and all of the samples Xk,t are independent. 
A policy 7r is a function that, at each time t, chooses 
the index kt of a distribution to sample based on the 
previously observed samples. That is, 

kt = 7r( fci,..., kt -i ,x kl! i, • • • ,x kt _ u t-i). 

'-v-' '-*-' 

past arm choices past values 

We would like to compare the performance of a policy 
7 r to that of an oracle policy 7r* that has access to 
knowledge of the distributions pfy so 


Furthermore, the flip side of making no assumptions 
on the underlying objective is that random search 
fails to adapt to easy problems. When the objective 
under consideration satisfies various regularity condi¬ 
tions (as real-world objectives often do), more heavily- 
engineered approaches will likely outperform random 
search. That said, it is not clear how to know that a 
given algorithm is outperforming random search with¬ 
out also running random search. For this reason, the 
benefits of a potentially faster algorithm are blunted 
when one must also run the slow algorithm to verify 
the performance of the fast algorithm. 

Given the variety of existing hyperparameter optimiza¬ 
tion algorithms, it would be desirable to devise a strat¬ 
egy for sequentially choosing which algorithm to use 
in a way that performs nearly as well as if we had 
only used the single best algorithm. We consider the 
simpler problem of choosing which of several distribu¬ 
tions over hyperparameters to use for random search. 
In Theorem El we show that even in this simplified 
setting, no strategy guarantees performance that is 
asymptotically as good as the single best distribution, 
at least not without stronger assumptions. 


We will frame our negative result i n t he ex - 
treme bandit setting ( Caroenticr and Valkol 2014!), 
also ca l led the m ax i f-armed bandit setting 
( Cicirello and Smithl . 120051) . Prior work has fo¬ 
cused on designing algorithms that perform asymp¬ 
totically as well as the single best distribution un¬ 
der parametric (or semip arametric) assumptions on 
the possible distribut i ons ( Cicirello and Smith, 200 5; 
ICarpentier and Valkol . I2014I) . Instead, we focus on 
probing the difficulty of the problem, pointing out a 
number of subtleties that arise in this setting that do 
not show up in the conventional bandit setting. 


2 The Extreme Bandit Setting 


Cicirello and Smithl ( 20051) introduce the extreme ban¬ 


dit problem (they call it the max AT-armed bandit 
problem) as follows. We are given a tuple of unknown 


k* =7r*(p 1 , /cj,..., /Ct_ l5 acfc.,1, —, i)- 


Both _ lCicirello~a nd Smi 3 (]2005 ) and 

Carpentier and Valkol ( 2014 ) phrase their results 
in terms of the maximization of a reward rather than 
the minimization of a cost. They define the “regret” 
of policy 7 r with respect to the oracle 7r* over a time 
horizon of T as 


E 

max x k * t 

-E 

ma xx ku t 


t<T * ’ 


t<T 


Under semiparamet ric_assumptions 

Carpentier and Valko (2014) exhibit a 


such that 


on ,uf, 
policy 7r 




is o E 


max an. t 

t<T ‘ 


( 1 ) 


or equivalently, 

E [max«T x k t ,t] 

Inn —r-=-r 

T-i-oo E inax«T x K,t\ 


( 2 ) 


The result in Equation Q] is superficially similar to re¬ 
sults in the standard bandit setting. However, while 
the condition i n Equation [I] i s sensible for the setting 
considered bv ICarpentier and Valkol ( 2014 1 (where the 
distributions /rf have unbounded support), it is par¬ 
ticularly sensitive to the nature of the distributions. 
For instance, the result in Equation |T| is trivially 
achieved when the distributions have bounded support 
(for example, when the support is contained in [0,1] 
as in hyperparameter optimization). In this case, both 
the numerator and denominator converge to the upper 
bound of the support and GJ ’ n * —> 0 (for any policy 
that chooses each distribution infinitely often). 


Furthermore, the condition in Equation [2] is asym¬ 
metric with respect to maximization and minimiza¬ 
tion. When performing minimization of a cost in¬ 
stead of maximization of a reward (using distribu¬ 
tions supported in [0,1]), both E [mint<r x kt ,t] and 
E \jma.t<TX k *A may approach 0, in which case the 
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ratio may exhibit radically different behavior. In Ex¬ 
ample [2] and Example [3j we demonstrate some of the 
peculiarities of this performance metric in the mini¬ 
mization setting. 

Example 2. Suppose pi is a Bernoulli distribution 
with mean parameter 0 < p < 1 and suppose that p 2 
is a point mass on 1. Consider a policy 7 r which 
chooses P 2 at t = 1 and then chooses pi for all t > 1 
and a policy 7r* which always chooses p\. We have 


lim 

T—t 00 


E [min t <r Xk t ,t] 
E [min t <T Xk* ,t] 


lim 

T —too 


n T-l 


1 

? 

p 


which remains bounded away from 1 even though the 
policy 7 r acted optimally at every time step after t = 1. 

Example 3. Suppose p\ is the uniform distribution 
over [0,1] and suppose that P 2 is a point mass on 1. 
Consider a policy 7 r which chooses P 2 at t = 1 and 
then chooses p\ for all t > 1 and a policy 7r* which 
always chooses p\. We have 


E [min t <T Xk t ,t] ,. T 1 
Inn —p--- — T = lim —--—- 

r-)-oo E [min t<T a;fc.. t ] t-s-oo (T + 1) 1 


Note above that the minimum of T independent uni¬ 
form random variables is a Beta(l,T) random vari¬ 
able, which has mean 1 /(T + 1). 


Despite the fact that the policy 7r acts optimally at 
every time step other than t = 1 in both Example [2] 
and Example^ the ratios of their expectations to that 
of the oracle 7r* exhibit wildly different behaviors. 

To avoid this sensitivity, we define “extreme regret” as 
follows. 

Definition 4. We define the extreme regret of the pol¬ 
icy 7 r with respect to the oracle policy 7 r* over a time 
horizon of T as 


r>-K,TT* 1 

R t = TX mm 

T' : E 

min Xfe.t 

< E 

minxfe*, t 

1 

T T T<> 1 

l 

t<T’ 


t<T * 

i 


Note that Rff n * depends on the tuple of distributions 
p^f, but we suppress this dependence in our notation. 

Then ,7r * is essentially the ratio of the time horizons 
T' to T over which the policy 7r and the oracle 7r* per¬ 
form equally well. This definition is sensible regardless 
of whether the samples are bounded or unbounded, 
whether we care about minimization or maximization, 
and regardless of how we scale or translate the distri¬ 
butions. Note that in both Example[2]and Example[5] 
we have —>. l. Despite its apparent dif¬ 

ference, as we discuss in Section 12.11 Definition [4] is 
closely related to the notion of regret used in the stan¬ 
dard bandit setting. 


Definition 5. We say that policy tt achieves “no 
extreme regret” with respect to the oracle 7 r* if 
limsup T i?f 7r * < 1 for all tuples of distributions /if. 

Definition [5] is fairly lenient. Had we defined “no ex¬ 
treme regret” using the condition given in Equation [l] 
our main result in Theorem ED could have been made 
even stronger, but we view that as undesirable as il¬ 
lustrated by Example [2] and Example [3] Moreover, 
the quantities in Definition 0] and Definition [5] closely 
parallel quantities of interest in the standard bandit 
setting, as we show in Section [2TTI 


2.1 Analogy with the Standard Bandit 
Setting 


Definition [4] and Definition [5] parallel the intuition of 
the standard bandit setting, which (when minimizing 
a cost) studies the rate of convergence of 


E 

Et=l x k t ,t 

— minfe E 

Et=l x k,t 

minfe E 

Et=i x k,t 



(3) 


Adding 1 to both sides, this is the same as studying 
the rate of convergence of 


E 

E t =i Xk t ,t 

T 

mint E[xfc,t] 


Now, observe that we have 


E 


Et=l x k t ,t 


T minfe E [x k ,t 


— mm 
T T’> 1 



E 


E t =i x kt,t 


minfe E [x k ,t 



(4) 


1 . 

— mm 
T T’> 1 



r t 


T' 


E 


f ' x k t ,t 

t= 1 


< minE 
k 


i=l 


which is essentially the ratio of the time horizons over 
which the policy and the oracle perform equally well. 
The two sides of the approximate equality in Equa¬ 
tion [4] differ by at most 1/T. In the standard bandit 
setting, the term “regret” often refers to the numera¬ 
tor in Equation [3] and not the quantity in Equation 2] 
However, as the above computation shows, the two 
quantities are closely related, and they capture the 
same phenomenon. We will phrase our results in terms 
of the quantity Rff 71 ' from Definition^ which parallels 
the quantity in Equation [4] 


3 Oracle Models 

In the standard multi-armed bandit setting, if an or¬ 
acle with knowledge of the distributions of the arms 
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seeks to minimize the expected sum of the losses, it 
should simply choose to play the arm with the low¬ 
est mean. This is true regardless of the time horizon. 
By analogy wi th the usua l multi -armed bandit settin g, 


Cicirello and Smith ( 20051) and ICarnentier and Valkol 


( 2014 ) both consider the oracle policy in Definition [6] 
that plays the single “best” arm. 


Definition 6 (single-armed oracle). The single-armed 
oracle is the oracle, which over a time horizon of T, 
plays the single best arm 


arg min E 

k 


va.rn.Xk ,t 

t<T 


The single-armed oracle provides a good benchmark 
for comparison, but it is not the optimal oracle pol¬ 
icy. When the time horizon is known in advance, the 
optimal oracle policy is given in Definition 0 

Definition 7 (optimal oracle). The optimal oracle 
over a time horizon of T plays the policy that solves 


arg min E 

7r 


min x kt 

t<T 



When the time horizon is not known in advance, one 
possible oracle strategy is a greedy strategy given in 
Definition [HJ 

Definition 8 (greedy oracle). The greedy oracle 
chooses the arm fc t * at time t that gives the maximal ex¬ 
pected improvement over the current best value yt-i = 
min s <t_i x k *, s - That is, 


k 


* 

t 


arg min E 
k 


mm{x k ,t,yt-i} I x k *, i, • ■. 


Unlike the greedy oracle, both the single-armed oracle 
and the optimal oracle require knowledge of the time 
horizon. Indeed, as shown in Example [9l the notion 
of a “best” arm is not well-defined outside of a spe¬ 
cific time horizon. The best arm depends on the time 
horizon. This point contrasts sharply with the usual 
multi-armed bandit setting. 

Example 9. Suppose we have an infinite collection of 
arms Ms indexed by 0 < s < 1. Let x Sy t be a sample 
from n s and suppose that P(x Sy t = s) = s and P(x Sy t = 
1) = 1 — s. Then the optimal s is ©((log T)/T). 

We elaborate on Example [9] in Appendix [A] One dif¬ 
ference between the single-armed oracle and the op¬ 
timal oracle is that the optimal oracle can adapt its 
strategy based on the samples that it receives, whereas 
the single-armed oracle is non-adaptive. Its strategy is 
fixed ahead of time. Example [TU] shows that the single¬ 
armed oracle is not even the optimal non-adaptive ora¬ 
cle. A mixed strategy may outperform any policy that 
plays only a single arm. 


Example 10. Consider a time horizon T = 2 and 
consider two arms. Suppose that samples x\j from p\ 
deterministically equal 1/2 and that samples x ^,t 
from /.l 2 satisfy P(x 2 y t = 0) = 1/4 and P{x 2 ,t = 1) = 
3/4. Then 


^ 2^=2 

w • 9 

E mm x 2y t = tt: 

l<t<2 16 

3 

E mk^irip, £ 2 , 2 } = g- 

This example shows that a fixed strategy that plays both 
arms can outperform any policy that plays a single¬ 
arm. 


We described three different oracle models above. 
One caveat is that in the event that there is a 
well-defined best arm, that is, some arm fc* such 
that P(x k „t < a) > P{x ky t < a) for all k and 
all 0 < a < 1, then these three oracles all coincide 
and we need not worry about which oracle to use for 
comparison. This is roug hly t he case in prior work. 
Ci cirello and Smith ( 2005 ) and ICarnentier and Valkol 


( 20141) make ( semi)parametric assumptions on the dis¬ 
tributions of the arms which essentially restrict the 
setting to have a well-defined best arm. 


Despite the fact that the single-armed oracle is not the 
optimal oracle strategy, it is often a sufficiently strong 
baseline for measuring the performance of our policies. 
When we cannot even do as well as the single-armed 
oracle, as will be the case in Theorem EH then we 
also cannot do as well as the optimal oracle. For the 
remainder of the paper, we will compare to the single¬ 
armed oracle. However, the results necessarily hold for 
comparisons to the optimal oracle as well. 


4 Main Result 

Theorem EH shows that no policy can be guaranteed 
to perform asymptotically as well as the single best 
distribution. That is, it is impossible to achieve “no 
extreme regret” in the extreme bandit problem. This 
result contrasts sharply with results in the standard 
bandit setting, where it is possible to achieve no regret 
under relatively mild conditions on the distributions 

Ti = (/xi ,...,/Zif). 

Theorem 11. For any policy n, there exist distribu¬ 
tions Mi' such that lim sup T Rff 7r * > K, where n * is 
the single-armed oracle. 

We prove Theorem EH in Section l4~3l The main com¬ 
ponents of the proof are Lemma EH which upper 
bounds the performance of the single-armed oracle and 































Robert Nishihara, David Lopez-Paz, Leon Bottou 


Lemma ITSl which lower bounds the performance of the 
policy 7 r. 

This result shows that the extreme bandit problem 
is fundamentally different from the standard multi¬ 
armed bandit problem, where a variety of policies per¬ 
form asymptotically as well as the single best arm. 
Indeed, in the standard bandit problem, the arms are 
primarily characterized by their means, and so it suf¬ 
fices to estimate the means of the arms and play the 
best one. However, as discussed in Example [9l there is 
no well-defined best arm in the extreme bandit prob¬ 
lem. Our construction will create a situation where the 
“best” arm periodically switches among the K distri¬ 
butions so that the policy 7r often ends up choosing 
the “wrong” arm. 

For i > 1, let cc, = (8A")“( l! ) . Our construction will 
involve a sum of point masses at the values on. It is 
easily verified that the sequence Oj satisfies the condi¬ 
tions in Lemma IT?! 

Lemma 12. The sequence on satisfies the following 
properties. 


arm over the time horizon Tj, and the other arms will 
be substantially worse. We will show that, for a fixed 
i , we can construct a tuple /xf so that the policy n 
takes roughly K times longer than the single-armed 
oracle 7r* to obtain the value on (that is, 7r* requires 
roughly Ti samples and 7r requires roughly T[ rs KTi 
samples). Using the probabilistic method, wc will then 
show that we can find a tuple pff so that the policy 
takes roughly K times longer than the oracle to obtain 
the value a* for infinitely many values of i. 


4.1 Upper Bound on Oracle Performance 


We begin by giving an upper bound on the perfor¬ 
mance of the oracle policy that plays the single best 
arm over the time horizon Ti. This bound holds uni¬ 
formly over Mk ■ 

Lemma 13. Suppose that /xf (b) £ Mk■ If ?r* is the 
single-armed oracle from Definition [71 then 


E 


min Xfc * 
t<Ti 


< 2 at. 


(A) £°1 ! Cy < 1/2 

(B) aii < 4 ( 1+i ) 

(C) £r= i+ i % < TR 

(D) on < OL\_fil~ l . 

Henceforth, we will not need the exact values of the se¬ 
quence, we will only need the properties enumerated in 
Lemma IT?! For b = (bi,b 2 , ■ ..) £ {1,..., K}°° , define 
the tuple of distributions /xf ( 6 ) = (/xi( 6 ),...,/x/c(&)) 
via 

OO 

Tk{b) = jk{b)6i + ^2 1 [bi = k] on 5 ai 
2=1 

where 

OO 

7 k(b) = 1 - ^2 = k]ati. 

i—\ 

Here, 5 C represents a point mass at c, 1[£] is the {0,1}- 
indicator function of the event £, and 7 &(&) is chosen 
to make Hkifi) a probability measure. Let Mk be the 
set of tuples of distributions that can be obtained in 
this way. The value bi simply assigns the point mass 
S ai to one of the K distributions. We let D denote the 
distribution over the set {1,..., K}°° defined so that 
the bf s are independent uniform random variables in 
(1 ,■■■,*}■ 

Define the time horizon Ti = [log(l /on)/af\. Instead 
of controlling Bif"* for every T, we will control the 
quantity specifically for the time horizons Ti. In our 
construction, the 6 ^th arm in the tuple will be the best 


Proof. Recall that bi is the index of the distribution 
that has a point mass at oii. We have 


E 

min Xk m .t 

= minE 

min Xk,t 

< E 

min x but 


t<Ti 

k 

t<Ti 


t<Ti 


The term on the right hand side can be rewritten as 


E 

11 

min x hi t < cm 

min Xbi 



t<Ti 

t<Ti 


+ E 

11 

min ax. t > 

mm Xbi,t 



t<Ti 

t<Ti 


< OLiP 

min t < Oii 

+ p 

min x\), t > Oii 


t<Ti 


t<Ti 


< on + P 


minxt- t > on 
t<Ti 


The first inequality follows by upperbounding the term 
min t <j' i Xb i} t by oti in the first term and by 1 in the 
second term. The second inequality follows by up¬ 
perbounding the first probability by 1. To finish the 
lemma, note that 


P 


minxh, t > on 
t<Ti 


<(1 


a.i) Ti < 


,-aiTi 


< on 


where the third inequality uses the definition Ti = 

ri°g(l/ a *)/ a il- n 


4.2 Lower Bound on Performance of 7r 

Here, we give a lower bound on the performance of 
any fixed policy 7r, when averaged over a collection of 
tuples of distributions. 
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Define the time horizon T[ = [ciK \og(\/oif)/on\, 
where c, = (1 — l/z)/((l + 1 /i) 2 + 2 /i). The constant 
Ci is a correction term that converges to 1 as * —> oo. 
Its specific value is not meaningful. The goal of this 
section is roughly to show that the performance of the 
policy 7 r over a time horizon of T' is comparable to the 
performance of the oracle policy over a time horizon 
of T t . 

Throughout this section, we will fix an index i and 
we fix bj for all j ^ i. Then we define the sequence 
b k ' = (bi, b k ', • ■ •) via b k ' = bj for j ^ i and b k ' = k'. 
The K tuples pf(b k ) for different values of k' are 
identical in all respects except for the index of the 
distribution that possesses the point mass 6 ai and the 
amount of mass 7 k (b k ) that the fcth distribution in 
the fc'th tuple assigns to <5i. 

Define the tuple of distributions pi (b) = 

0 ?i (b),...,rf K (b)) by r] k (b) = ± EEi Vk{b k ')- 

Let 7 k(b) := jt Y^k'=i lk(b k ') denote the probability 
that rj k (b) assigns to the value 1. The tuple r]f(b) is 
the average of the tuples p^(b k ) over the different 
values of k!. 


We begin with Lemma [14] which compares the proba¬ 
bility that policy n obtains the value oti when averaged 
over the tuples pf{b k ) with the probability that 7 r ob¬ 
tains the value oti in the tuple 77 ^( 6 ). This compar¬ 
ison is helpful because each distribution in the tuple 
rjK(b) assigns the same mass of on/K to cm and so the 
probability that 7 r obtains op when run on the tuple 
rii(b) does not depend on 7 r (it is simply (1 — oti/K) Ti 
where T[ is the time horizon). Of course, as stated, 
we are actually concerned with the probability that 7 r 
obtains a value less than or equal to op, but because 
of Lemma I I'/)] CJ) [ the contribution of the smaller terms 
will not be too great. 

Lemma 14. We have 


1 

K 
> cP 


K 



-1 

E p 

mmx kt t > OLi- 1 
t<T! 

rf(b k ') 

k '=1 




min x kf .t > on- 1 

Vi(b) 

1 

t<T[ 




2 oi i T i 

where c = e rrr 5- , In our notation, we condition 
on pi(b k ) to indicate the tuple of distributions being 
used. 


Proof. Define S(n, pf, T) to be the set of actions and 
values that can be obtained by following policy 7 r on 
the tuple p^ for a time horizon of T. That is, 


( kt , Xt) t ~i 


k t = 7r(fci,..., k t -i,xi,... ,x t - 1 ) 
x t G supp (pk t ) 


where supp(pk t ) is the support of the distribution 
Pkf Then define S(n, pi ,T,i) to be the subset of 
S(n, plf ,T) such that all values are greater than or 
equal to oti- 1 . That is, 


S{tt, Pi ,T,i) = {(fc t ,x t )f =1 G S(n, Pi ,T) : x t > op- 1 } . 


Critically, note that 


S(ir, V «(b),T',i) = S(ir,p?(b 1 ),Tl,i) 


= S(n,pi (b K ),T-,i). 


(5) 


Equation [5] holds because the supports of the tuples 
Pi* {b k ) and r)i {b) only differ on ct;, but we are con¬ 
sidering only values that are at least 1 , so this dif¬ 
ference does not affect the sets. We shall refer to this 
common set as S. We have 


minii- t dj_i 
t<Tl 


Em 

5 \j =1 


i -1 K 

\{t:x t =atj}\ 


a- 


p?(b k ') 

K 

^7 k (b k ')\{t- k t= k ’Xt=l}\ 


k= 1 


( 6 ) 


(7) 


It follows that 


K 


K 


E p 


1 


k '=1 
K 


min it t > 1 

+ <"T' 


rf(b k ') 


t< T i 

i-1 K 


= *ee 

k '=1 S \j =1 


\{t:xt=a j }\ 


^ 7 fc ( 6 fe ') l{t:fc ‘= fe - a: *= 1 > 1 


i—1 


Em 

s \3 = 1 


a 


\{t:x t =oij}\ ( _j_ 

K 


fe= 1 
K K 


En^) |{t: 


k’ = l k=l 


k t —k,x t — 1 }| 


( 8 ) 


where the first equality uses Equation[ 6 ]and the second 
equality simply rearranges the terms. We would like to 
essentially apply Jensen’s inequality to say something 
like 


K K K 

7 y Y[-/ k (b k ') l{t:kt=k ’ xt=i}l > Y[-f k (b) l{t:kt=k ’ xt=i}] . 

k' = lk=l k =1 

(9) 

Unfortunately, despite the fact that 7 k is convex on 
the relevant region, n7i 7fc is not quite convex. How¬ 
ever, it is nearly convex, and as we show in Lemma [171 

2oti T( 

Equation [9] holds up to a correction factor of e . 
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Using this result in Equation [5] gives 


K 


— V p 
K ^ 


fc '=1 

2 ctiTt 


min x k t > oti-i 

t<T 


Combining the Equation [TO] and Equation [TT1 gives 

K 


> e — 


Em 


i—1 K 

| {t: xt=otj } | 


D=l 


2a i T< 

= e —P 


minXfe t > ai-i 
t<T' 


K 

^[7 k (b)\ {t:kt=k ' xt=1} \ 

k=l 

V lib) . 


~y e 

K 


K 


fc'=1 


mm x k t 

t<T! 


y 7 ') 


. ( 1 +i) 2 ^ 

> a.i -ie ^ a. 


^ o t ^ (i+t) 2 c 
> 2a? a 4 a r 


= 2 ai. 


The second inequality uses Lemma HfD)] and the 
definition of T[. The third line uses the definition 
Ci = (1 — l/i)/((l + l/*) 2 + 2/i), which was chosen 
the first inequality uses LemmalTTland the last equality to make the third line hold. This completes the proof 

holds for the same reason that Equation [G] holds. □ of the lemma. □ 


In Lemma 1151 we turn the bound in Lemma [TO] on the 
probability of obtaining a^ into a bound on the per¬ 
formance of 7 r. Note that Lemma [TO] holds uniformly 
over the values of bj for j ^ i. 


Noting that Lemma [TO] holds uniformly over the values 
of bj for j ^ i, a direct consequence of Lemma [TO] is 
Corollary [TO] 

Corollary 16. We have 


Lemma 15. We have 


1 

K 


E E 

h' = 1 


min x kt ,t 
t<T 



> 2a^. 


Proof. We have 


K 


K 


E E 

k' = 1 


nun Xk t ,t 
t<T! 


tf{b k ') 


> 


OLi -1 

K 


K 


E p 


k' = l 


min x kt ,t > OLi~ 1 

t<T' 


P?(b k ’) 


> a.i-\e iK P 


min x kt ,t > a.i-i 

t<T' 


<{b) 


( 10 ) 


The first inequality is Markov’s inequality. The second 
inequality is Lemma [TOl We have 


| E 

min x kt t 

A<f {b) 

V 

t<T / 



> 2ai 



where D is the distribution over {1,..., }°° defined 

by sampling each component independently and uni¬ 
formly at random from {1,..., K }. The outer proba¬ 
bility is over b, and the inner expectation is over the 

x kt ,t- 


4.3 Proof of Theorem 1111 

Here we synthesize the above results to prove Theo¬ 
rem HU Lemma M and Corollary [TO] together imply 
that 


Pb~D ME 

min x ku t 

Ti(b) 

> 2ai > E 

min Xk*,t 

Ti(b) 

V 

t<T r 



t<Ti 




P 


m in x kt t > a-i -1 

t< T l 


<(b) 


> 




>-yyyj : 

> e ~Ml + 7) 2 T’/K 

> a' I+J,V 

(U) 

The first inequality lower bounds the probability of 
obtaining a value of OLi or less at every iteration. The 
second inequality uses Lemma IL]lC)| Th e third in¬ 
equality uses Lemma [TOl and hemma Tl ?]H1)| The fourth 
inequality uses the definition T[ = |_CiA'log(l/aj)/aiJ. 


which directly implies that P{Ry* > Tl/Tf) >1/K. 
Recall that for a sequence of events Ai, we have 
P(infinitely many Ai happen) > limsupP(Hi). This 
can be seen by applying Fatou’s lemma to the relevant 
indicator functions. It follows that 

/ T’ 

Pb~D Ry* > — for infinitely manyi 
V * Ti 

Recall the definitions 

R = [log(l/ a^/ai\ T/ = \_CiK log(l/ af)/ a^J. 

Since c* —> 1, it follows that T-/Ti —>• K, and so there 
exists a tuple S Mk such that limsup T R ^’ 7r * > K, 
proving the claim. 
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5 Related Work 


Our setting is closely related to the multi-armed ban- 
dit problem , which has been stu died extensively. See 
Bubeck and Cesa-Bianchi ( 20121) for a survey. Regret 
is the most common measure of perfor mance, thou gh 
some authors study “simple regret” ( Bubeck et all 
20111 ). where the goal is to identify the arm with the 
lowest mean. However, these settings provide little 
guidance on designing a policy to minimize the single 
smallest cost. The extreme bandit problem, where we 
care not about the average cost but about the single 
minimal cost, has been significantly less studied. 

The extreme bandit problem (also called the 
max AT-armed bandit p roblem) is introduced in 
Cicirc]lp_ajid_^mith 02005) and further developed in 
Streeter and Smith (i2006alb ). T he problem is ad - 
ditionally studied in Carnentier and Valkol ( ! 2014l ). 
where the authors give an explicit algorithm and 
prove that it exhibits asymptotically no regret in the 
sense of Equation [Q However, all results in pre¬ 
vious work have relied heavily on strong paramet¬ 
ric or semiparametric assumptions on the distribu¬ 
tions jif' und er consideration. M o tivate d by extreme 
value theory, ICicircllo and Smith ( 20051 ) assume that 
the distributions belong to the Gumbel family and 
Carnentier and Valko ( 20141) consider distributions in 
the Frechet family (or distributions that are well ap¬ 
proximated by the Frechet family). When the individ¬ 
ual samples arise as the maxima of a large number 
of independent, identically-distributed random vari¬ 
ables, then these assumptions may be realistic. These 
assumptions dramatically simplify the problem. As 
in the multi-armed bandit setting, where every sam¬ 
ple from a distribution provides information about the 
mean of the distribution, in the parametric setting, ev¬ 
ery sample provides information about the parameters 
of the distribution. Once we have accurately estimated 
each distribution, we can make sensible choices about 
which distribution to choose. Our work shows that 
some form of assumptions are necessary to improve 
on the guarantees of the policy that chooses each arm 
equally often. 


The no free lunch theorems are another form 
of hardness result in the optimization setting. 
Wolpert and Macreadv (1997) show that in a discrete 


setting, all optimization algorithms that never revisit 
the same point perform equally well in expectation 
with respect to the uniform distribution over all pos¬ 
sible objectives. 


6 Discussion 

We have shown that a number of subtleties arise in 
the extreme bandit setting that are not present in 
the standard bandit setting. These include the fact 
that there is no well-defined “best” arm and the fact 
that strategies that play multiple arms can outper¬ 
form oracle strategies that play a single arm. We have 
shown that no policy can be guaranteed to perform 
asymptotically as well as an oracle that plays the single 
best arm for a given time horizon. This result should 
not be construed to say that no policy can do bet¬ 
ter in practice. Indeed, hyperparameter optimization 
problems in the real world possess many nice struc¬ 
tural properties. For instance, many hyperparameters 
have a sweet spot outside of which the algorithm per¬ 
forms poorly. This suggests that many black-box ob¬ 
jectives for hyperparameter optimization may exhibit 
coordinate-wise quasiconvexity. Crafting plausible as¬ 
sumptions on the objectives and understanding how 
they translate into conditions on the induced distri¬ 
butions over algorithm performance is an important 
problem. 

Acknowledgements 

We would like to thank Balazs Kegl for valuable dis¬ 
cussions. We would like to thank Kevin Jamieson and 
Ilya Tolstikhin for their feedback on earlier drafts of 
this paper. 


References 

J. Bergstra and Y. Bengio. Random search for hyper¬ 
parameter optimization. The Journal of Machine 
Learning Research , 13(1):281-305, 2012. 


We do not expect the parametric assumptions moti¬ 
vated by extreme value theory to make sense in the 
setting of hyperparameter optimization. However, the 
question of what realistic assumptions are likely to 
hold in practice for hyperparameter optimization is an 
important question. 


More recently, David and Shimklnl ( 20151) consider a 
PAC setting for the extreme bandit problem and prove 
a lower bound on the sample complexity of algorithms 
that return an answer within e of the optimal attain¬ 
able value with probability 1 — 5. 


J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. 
Algorithms for hyper-parameter optimization. In 
Advances in Neural Information Processing Sys¬ 
tems , pages 2546-2554, 2011. 

S. Bubeck and N. Cesa-Bianchi. Regret analysis 
of stochastic and nonstochastic multi-armed ban¬ 
dit problems. Foundations and Trends in Machine 
Learning , 5(1):1 -122, 2012. 

S. Bubeck, R. Munos, and G. Stoltz. Pure explo¬ 
ration in finitely-armed and continuous-armed ban- 






















































Robert Nishihara, David Lopez-Paz, Leon Bottou 


dits. Theoretical Computer Science, 412(19): 1832- A The Best Arm Depends on the 
1852, 2011 . Time Horizon 


A. Carpentier and M. Valko. Extreme bandits. In Ad¬ 
vances in Neural Information Processing Systems, 
pages 1089-1097, 2014. 

V. A. Cicirello and S. F. Smith. The max K-armed 
bandit: A new model of exploration applied to 
search heuristic selection. In Proceedings of the Na¬ 
tional Conference on Artificial Intelligence, 2005. 


In Example [9l we considered an infinite collection of 
arms p s indexed by 0 < s < 1. Samples x S)t from p s 
satisfy P(x S}t = s) = s and P(i Sjt = 1) = 1 — s. We 
claimed that for a time horizon of T, the optimal s is 
0((logT)/T). 

We have 


Y. David and N. Shimkin. The max fc-armed bandit: 
A PAC lower bound and tighter algorithms. arXiv 
preprint arXiv:1508.05608, 2015. 

J. Duchi, M. Jordan, M. Wainwright, and 
A. Wibisono. Optimal rates for zero-order 
convex optimization: The power of two function 
evaluations. IEEE Transactions on Information 
Theory, 61(5):2788-2806, 2015. 

N. Hansen. The CMA evolution strategy: a comparing 
review. In Towards a New Evolutionary Computa¬ 
tion. Advances on Estimation of Distribution Algo¬ 
rithms, pages 75-102. Springer, 2006. 

J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. 
Wright. Convergence properties of the Nelder-Mead 
simplex method in low dimensions. SIAM Journal 
on Optimization, 9(1): 112—147, 1998. 

M. Powell. The NEWUOA software for unconstrained 
optimization without derivatives. In Large-Scale 
Nonlinear Optimization, volume 83, pages 255-297. 
Springer, 2006. 

J. Snoek, H. Larochellc, and R. P. Adams. Practi¬ 
cal Bayesian optimization of machine learning algo¬ 
rithms. In Advances in Neural Information Process¬ 
ing Systems, pages 2951-2959, 2012. 

M. J. Streeter and S. F. Smith. An asymptotically op¬ 
timal algorithm for the max K-armed bandit prob¬ 
lem. In Proceedings of the National Conference on 
Artificial Intelligence, 2006a. 

M. J. Streeter and S. F. Smith. A simple distribution- 
free approach to the max K-armed bandit prob¬ 
lem. In Principles and Practice of Constraint 
Programming-CP 2006, pages 560-574, 2006b. 

Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and 
N. de Freitas. Bayesian optimization in high dimen¬ 
sions via random embeddings. In International Joint 
Conferences on Artificial Intelligence, 2013. 

D. H. Wolpert and W. G. Macready. No free lunch 
theorems for optimization. IEEE Transactions on 
Evolutionary Computation , l(l):67-82, 1997. 


E 


minis t 
t<T 


s(l—(l-s) T )+l(l— s) T = s+(l—s) T+1 . 


Let s* be the index of the optimal distribution, so 
min s E[min t <T Xs„,t] = s* + (1 — s*) T+1 . For large T, 
we can consider the range 0 < s < We have 

s + e _2s(T+1) < s + (1 - s ) T+1 < s + e -s(T+1) . 


It follows that 


s* + e _2s *( T+1 ) 

< min E 

S 

< mins + e~ s< ' T+1 ' ) 

S 

z log T 


minis t 
t<T 


1 

T + l ' T 
2 log T 
T 


Therefore, s* < (21ogT)/T and e 2s *( T + 1 ) < 

(21ogT)/T. The latter implies that 


s* > 


- log 2 - log log T + log T 


2(T + 1) 

These results imply that s* is 0((logT)/T). 


B Proof of Lemma [TT1 

Here we state and prove Lemma EH which is used in 
the proof of Lemma Q4] The goal of Lemma [T7] is to 
show that the probability of a particular sequence of 
values under the tuple /if {b k ), when averaged over 
the possible values of k', is at least as great (up to a 
constant c) as the probability of the same sequence of 
values under the averaged tuple rjf'. Since all values 
other than the values 1 and have equal probability 
under all tuples (for j ^ i, the value ay has probability 
ctj under the 6 jth element of each tuple), this lemma 
focuses on the probabilities of the values that equal 
1. Recall that 7 k{b k ) is the probability of obtaining a 
value of 1 from pk{b k ) and 7 k(b) is the probability of 
obtaining a value of 1 from rjk- 
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Lemma 17. For integers ni,... ,uk > 0 such that 
nk < T, we have 

K K K 

^J2U^ bk ') nk ^cl[lk(b) nk , 

fc'=lfc=l k= 1 


2a j T 

where c = e i*. 


Proof. This result nearly follows from Jensen’s in¬ 
equality. Indeed, if the function 


Lemma EfU that we can write 


fc'=l 

]> g — + ^ X]fc' = l Pi,k ,n k' 

2Q!} T* 1 y - ' K o 

> e iK g K Lfc / = lPi,fc ,n fc , 




> — V 

- K ^ 


k' — l 


> 




(13) 


K 

f(c 1 , ■ • • ,ck) = I 


fc=l 


/ oo \ 

1 - Cfctti - ^ l[j = k]aj 

V T4 ) 


The second inequality is Jensen’s inequality. The third 
inequality breaks the 1 + 1 ji term into two terms and 
uses the bounds Pi.k 1 < 2and nk' < T. The fourth 
inequality uses the fact that e~ x > 1 — x. Combining 
Equation [T3] and Equation [13] gives 


were convex, then the result would follow from a single 
application of Jensen’s inequality. That is, the result 
with c = 1 is precisely the statement 

/(i,o,...,o) + --- + /(o,...,o,i) n 

K ~ 1 KJ ' 

Unfortunately, despite the fact that / is the product 
of convex functions (over the relevant domains), / it¬ 
self is not convex. To circumvent this difficulty, we 
will approximate each term with the exponential of an 
affine function, so that the product of approximations 
remains convex (because the affine functions simply 
add). As our approximation is imperfect, we pick up 
a penalty in the form of the constant c. Let 


k'=l k= 1 



K 

= e-^n 7fc(5T, 

k =1 


which finishes the proof. □ 

C Upper Bound on Exponential 


Ujk = 1 - V] m = k ] a j Pi,k = —, 

3=1 Wfc 

First write 

f E 

k'=1 k=1 
K K 

( i2 ) 

fc'=1 k =1 

= rffK*) Ea-a,*)"*'- 

\fc=i / fe'=i 

Note that by Lemma we have w*/ > | and 

so Bi.k' < 2a^. It follows from Lemma [18] and 


Throughout this paper, we make use of the inequality 
e~ x > 1 — x. However, on a couple of occasions, we 
need to lower bound 1 — x by an exponential of the 
form e~ rx for some constant r. The bound that we 
use is given in Lemma [TSl 

Lemma 18. For i > 1 and y £ [0, we have 

e - 2 / ( 1 +t) < 1 - 2 ,. 

Proof. More generally, the convexity of e~ x implies 
that for 0 < x < c, we have 


c 

The right hand side is the formula for the line inter¬ 
polating between the points (0,1) and (c, e~ c ) on the 
graph of e~ x . Choosing c = log(l + A), and noting 
that 0 < x < yU implies that 0 < x < c because 
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of the standard inequality 1 — - < logs, we see that 
0 < x < implies that 


< 1 - 


1 - 


i+* 


log(l + \) 


X < 1 — 


1 

1 +i 
1 


X =1 — 


1 + i 


x. 


Setting y = j^x and using the fact that 2 (i+i) — 
Tjqp jp gives the result. □ 







